Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Min Nan extraction function #397

Merged
merged 8 commits into from
Mar 27, 2021
Merged

Conversation

lfashby
Copy link
Collaborator

@lfashby lfashby commented Mar 27, 2021

  • Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

This adds an (incomplete) Min Nan extraction function in the hopes that it will nudge us toward developing solutions to #259 and #329. This extraction function currently only targets entries from the Hokkien 'dialect' (just because it seemed the most prevalent) and a user can then specify 'subdialects' of Hokkien with --dialect. I've added some data with (sub)dialect set as Xiamen.

To improve/expand the coverage of this extraction function we need to settle on a solution to #329. One solution might be to add a --subdialect option and have --dialect be used for Hokkein/Teochew and --subdialect for the nested dialects like Xiamen/Taipei (so for Portuguese we could set --dialect as Brazil and --subdialect as Paulista/South Brazil). This would be easy to implement but would probably be confusing to users. Also it would require users (or people running the big scrape) to scrape the same language a whole bunch of separate times if they were interested in getting data from all the dialects and subdialects of that language in separate TSVs.

An alternative solution would be to try and revamp our dialects system somewhat like so: If a user runs something like wikipron nan --dialects we run through the language once and write as many TSVs as there are dialects/subdialects in that language (easier said than done). If a user runs wikipron nan we run through the language once and they get one TSV containing all the entries from all the dialects/subdialects. I think there are different ways of doing this but the coolest would be to 'discover' the dialects/subdialects as we scrape a language and automatically write them to different TSVs.

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

I am fine with the --subdialect proposal but agree that --dialects is a superior solution. It should also reduce our need to micromanage this all, right?

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Kudos to you, Lucas, for working on these challenges! Understood that this Min Nan scrape was arbitrarily set for Hokkien for now, and is subject to change in future PRs based on how we want to handle subdialects.

The --dialects proposal does sound great (might need another flag name -- too easily confused with the existing --dialect), but I see implementation may be tricky (which Lucas has already alluded to). So far we've seen the Chinese-styled and Brazilian Portuguese-styled subdialect formatting on Wiktionary. Are there other flavors we haven't come across yet?

Comment on lines +1272 to +1275
"zyyy": "Common",
"latn": "Latin",
"hira": "Hiragana",
"hani": "Han"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a quick reminder -- where do these come from again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our new languages_update.py postprocessing step.

@lfashby
Copy link
Collaborator Author

lfashby commented Mar 27, 2021

It should also reduce our need to micromanage this all, right?

Certainly, though perhaps by introducing something even more demanding of micromanagement!

So far we've seen the Chinese-styled and Brazilian Portuguese-styled subdialect formatting on Wiktionary. Are there other flavors we haven't come across yet?

Not that I'm aware of, though it'd definitely be best to go on a bit of a hunt for them before trying out this dialects approach.

@lfashby lfashby merged commit db093e6 into CUNY-CL:master Mar 27, 2021
@lfashby lfashby deleted the min branch September 13, 2021 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants